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ABSTRACT 



Three assumptions underlying the use of norm 
referenced tests, are examined: (1) that expressing treatment effects 
in a standard score metric permits aggregation of effects across 



qraides; (2) commonly used standardized tests are sufficiently 
^comparable to permit aggregation of results across test's; and 



the 
an 



saHmer loss of achievement observed in Title I projects is due to 
actual loss in achievement and skills. Hypotheses regarding tne 
standardized growth expectation (SGE) are also presented; SGE refers 

th^ amount of growth (expressed in standard deviation form) tnat a 
studeat must demorfstrate oVer the treatment interval to mainta:in 
standing .in the norm group. SGE may also be conceptualized as tne 
di^fference, between the pretest percentile and the, posttest 
percentile. Hypotheses are presented regarding the decrease in SGE 
which accompanies ' grade increases, and the variation in SGE according 
to the achievement test used. Further research topics investigating 
the validity of the SGE" phenomenon are suggested. (Author/GDC) 
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• . The purpose of this paper is to review s"ome assumptions underlying 
the use of norm-referenced tests in educational evaluations and toprovide a 
prospectus for research on these assumptions as>ji!<ell as'other questions 
related to. norm-referenced tests. Specifically, tfV assumptions which will 
be examined^ are "(1) expressing treatment effects in aXstartdard score metric 
permits aggregation of effects across grades, f27^pmm\^ly used standardized 
tests are sufficiently Comparable to permit aggregation of^ results across 
tests, and (j) the summer loss observed in Title I projects is due to an 
•ac^;ual loss in achievement skills and" knowledge. We wish to emphasize at 
the outset that"our intent in this p^.r is to raise ^questions and -not to' 
present a ctiherent set of answers.. * y ' 

Throughout this paper we make use of an index termed/the "standardized 
growth expectation" j(SGE). The SGE is defined to be theAmount of growth, 
.(expressed in standard deviation form) that a student must demonstrate oyer ' 
a given treatment interval to maintain his/her relative, standing in. the norm 
"group (Stenner. et al . , 1977). The SGE re5ts on the assumption that a- student 
will attain the same raw score on the pretest and posttest if no learning 
has taken place between testings. If the pretest raw score o's equivalent to 
a national percentile. of 50 and the same raw score is entered. -into the" 
corresponding posttest percenti.le table, the resulting percentiTe score will 
be less than 50. The difference between tlie pretest" percentil e and the • 
posttest percentile expressed in standard score form is .termed the SGE. 
Stated another way, the SGE is the amount that a student at a .particu|lar 
pretest percentile is assumed to learn 'over a period of "time or, conversely, 
theloss in^relative. standing that such a student would suf/er if he/she 
learned nothing during the time period. 

^ An example may help to clarify the procedures used to calculate the 
SGE. Table 1 presents a raw score to percentile conversion table for be- 
ginning of first grade and end of first grade on the Total Reading scSle of 
the Comprehensive Test of Basic Skills, Form S. The Swerage (50th percentile) 
beginning first grade stude'nt attains 9 raw 'score of Sl'on Total Reading; 
Under the assumptU>o that this illus/rative average student learns nothing 
in the first grade, he7^4w..^£ou2d^/l5e ^ to again obtain, a raw score ' 

of I31 on the posttest";" Wherea^ a raw score of 31 is equivalent to a be- 
ginning first graide percentile of 50, it represents an end of first grade 
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percentile of 9. If both percentiles are converted to Z scores and subtracted, 
the result is an SGE of 1.39 (i.e., the 50th percentile equals a I score of 
zero, whereas the 9th percentile equals a Z score of -1.39).y In other words, 
if an average student 1 earns^ nothing about reading during the first grade, 
he/she would be expected to lose 1.39 standard deviation units in relative 
standing' because that is the amount of standardized growth exhibited by the 
national norm group during the first gradeJ 

Grade-to-Grade Variation in SGEs ^ 

Some educational evaluations which employ norm-referenced achievement 
tests share a common^^ssumption/'^melyi that observed treatment effects * 
(e.g., differences betw^n ,sta/(idardized means of observed treatment group 
.posttes't scores and expected treatment group posttest scores) are comparable 
across grades. Stated another way, it has been assumed that a one-third 
standard' deviation difference between experimental and control students' 
reading comprehension has the same meaning whether observed at the second, 
fifth, or seventh grade levels. It has also been assumed, with apparent logic, 
that if ^ special prog^ram consistently demonstrates larger treatment effects 
in the primary as opposed to intermediate grades, then compensatory efforts 
should be concentrated, at the lower level. In fact, the^ twelve-year history 
of ESEA Title I documents a na1;ionwide trend toward focusing increasing 
^mounts of compensatory education efforts .on primary students. Numerous 
evaluation^studies'have supported this movement through findings that larger 
treatment effects are possible with youfiger students. . One, questipn raised 
in this paper is whether or not there is a built-in bias in^'our evaluation 
^m^hodology and/or instrumentation that insures finding more "exemplary" 
t--p«ograms at the primary grade levels. 

- We raise this issue of cross .grade comparisons because educq,tional 
policy may rest upon the legitimacy of just such comparisons. For example, 
a revie^ of seventy-three school desegregation studies concluded that, 
the critical period for desegregation Cto maximize black students' achieve- 
ment) is!prior to third grade (Grain and Mah^rd, 1977)^Black students 



The SGE differs slightly depending upon where in' the pretest distribution 
\r the xaw score i s . selected to be^entered into the posttest percentile dis- 
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^r1bution._ This differen^ is of interest in its own right, but introducing 
^I'^J-l^-^^ in__^the present paper woufd unneces'sarily confuse the presentation. 



. desegregated beyond the third grade tend to show lower achievement test gains. 
Of the ten studies involving first and second graders, eight showed that 
desegregation produced higher achievement levels and two showed no effect. 
Only nine of twenty-one studies showed higher achievmenent among third and 
fourth grade students and for students in grades five' through twelve, only 
sixteen of thirty-one studies showed any achievement gain attributable to de- 
segregation. Eleven of th^e seventy- three studies rg^/'iewed by Grainy and Mahard 
were not analyzed by grade. Taken at face value, these findings suggest that 
youngef^ students benefit more Vrom desegregation (in terms of achievement) • 
than do older students. Grain and Mahard (1977) conclude: "The review of 
these studies is inconclusive or debatable on nearly every point exdipt that ' 
desegregation in the early grades is superior to desegregation in the later 
grades" (j.l9). It is precisely this k^nd of conclusion based upon cross grade 
company of norm-referenced achievement tests that may be invalid.' / 

Jabie 2 presents standal^dized growth expectations for five commonly used 
norip-referenced achievem;ent tests. The full year SGEs show a ^consistent ^ 
decrement with each grade. The negative relationships between gride and '^E 
hold for both reading and mathematics across all five tests. The differences - 
between second grade SGEs and eighth grade SGEs averaged across tests .ex):eeds 
one-half standard deviation. Interestingly, the largest losses in SGEs fo^ 
both Total Reading and Total Math occur duriing the third grade period. Several 
recent 'Studies on test score decline (cf Wirjtz, 1977) have concluded that 
test scores begin to drop at the fourth graqe level. It might be rewarding 
to investigate the possibility that SGE decrements . are causally implicated 
in reported test score declines. The Stanford Achievement Test, for example, 
exhibits almosft a fifty percent decrement in Total Reading "SGE from second to 
l^rd grade. Similarly, the IT3S, CAT-77, and GTBS-S all show decrements ^ 
approaching twenty-five perceir^ { 

Insight into the implic^ions that these grade-to-grade differences 
may have for educationalVevaluation is gain'ed by realizing thai a treatment 
effect o*f one-third standard deviation (pftien employed as a threshhold value 
for an educationally meaningful br practically significant effect size) 
represents a 33% increase above expectation for second graders on the MAT ^ 
Total Reading and a 200?^ increase above expectation for eighth araders. If 
the ongoing instructional process is incapable of producing more than". 15 SD's 
of gro\vth on MAT Total P.ead<ing among eighth graders, then it seems somevvhat^/ 
unrealistic to expect- an eighth grade compensatory program or desegregation \ 
effort to demonstrate a treatment effect .Of .33 SD's. •. 



The conciasion implied above is that ajr/ cross grade comparisons of 
treatment effects expressed as national percentiles, standard scores , MCEs 
or grade equivelents are usually inappropriate. Although treatment effects 
expressed in standard score form may ^be smaller at the eighth grade level 
than at the second grade level, the statistical significance (e.g.,T ratio) 
of these effects may be the same for both grade levels (assuming equal sample 
size's) because of the increased - pre-post correlation at the eighth grade 
leve4. Thus to apply an arbitrary treatment effect -cri terion of .33 SD's 
(or any other uniformly applied criterion based upon standardized scores) 
when screenin()v for exemplary projects or reviewing research studies, unfairly 
discrimilT^tes against upper grade prdjecl/s. A metric for critical or 
educational significance which would be/comparabl e across grades cannot be 
formulated witho'ut consi deratioii of the fact ^at pre-post correlations 
increase, with grade. 

FollowifTg are five hypotheses regarding deSt-ement. in SGE as grade in- 
creases. It is higly likely that several of these alternative explanations 
combine to account for cross grade di fferences :^ Al though the first two 
hypotheses are intui tively .^nore appealing than the others, much more study of 
the merits of each explanation is recommended. 

Domain Expansion. Hypothesis 3| ' With each increase in grade the relki^ent domain 
(e.^.^^reading or math) expands A terms ofCt'ne number ^of concepts - encompassed 
by i}ie domain. The result of an expandis^ domain is that a fixed number of 
items w.ill be less and less representative; proportionately fewer Items can 
be allocated to any^ given span of concepts and objectives. As the range of 
concepts and objectives covered by a test increases, the^ SGE decreases and 
edumetric validity is^ reduced (cf Carver, 1974). The poorer the m^tch between 
what is taught at a given grade level aritl what is tested, the less Sensitive 
,the test is to growth, and the lower the SGE. 

Shi f ti ng Constructs - Hypoth es i s : The levels of some tests are not well 
articulated and with" each succeeding grade Stable organizing influences 
other than reading or math acJ^evement increasinly determine students' 
scores on norm-^ref^rbnced tesi^s. 'For example, if reasoning ability becomes 

- — ' ^ ' J- ' • \ • 

Per^ps a m^re methodologically defensible metric would be the standard 
devtattorr of^e pre-post residuals. This metric should be comaprable 
across\3rader^Y'^ce it. is. adjusted for pr^-post correlations. 



progressively more confounded with reading and math achievement scores as 
greade increases, and reasoning ability gravis at an increasingly lower rate 
than reading and math achievement, then. the confounded reading a'nd math SGEs 
wou^d be expected to decline as confounding increases. As what is ^measured 
by' norm-referenced reading and math, tests changes , the edumetric validity 'of 
these tests may be reduced. i * ^ 

L&arning Curve Hypothesis : The deteHovating 3GE is due to an actual slowing 
. in the rate of learning similar to the .way height slows down^ from birth to 
eighteen years ^of age. According to this hypothesis younger students have a 
greater capacity for learning and this capacity (deteriorates with age.. , ' 

Unequal Interval rlypothesis ' Standard^eviation units are not equal interval^fi ' 
across 'grade. Imagine a Grubber 'band^Jnarked into^ ten equal intervals repre- . 
senting'the one standard deviation-^ at second grade on the'' MAT;Tbtal 
' Reading. Mow imagine^the rubber, baLndf\$tretcheg^ to the poliit th'afe the distance^.^ 
between any two marks is equal ^o':||^OT;tir^ 1^ the ,uns tretched rubber, 

band. In this way w6 can see hdH^-^rDv|^^^^^ the sev^th grad^ on f?ie MAT ^ ■ 
Total Reading CSGE ^.10) might^ eq<i;a^-< gct^jj^^^^ the second grade (^GE = 1.0). 
Jf this^fypo thesis wer^ accepted,\ tke valid] ty of , cros^ grade comparisons ' ^ 
vio\Ajsbe que^^ftiamble, but .'^o^^i^ould- just aJ^p^^ other •compar;isons of 
interest in educational evaluation; '"^ - ■ ■ * 

Instructional Emphis.ij. Hypothesis ; Upper, grade teachers do not emphasize ' 
reading and math as^^meh as -lower grade ^eachers^ and^ as a consequence^ 
students learn less and. subsequently shew ,less growth on 7wrm-referenced 
■reading and njathematics tests. As upper grade teachers '6'onc|e'n£ra te less on 
reading and math instruction than primary grade teachers , thW^iE decreases. 

The five hypotheses are rank ordered from most lij$;^ly to least likely 
[in our opinion) as explanations for the observed Recrement in SGE as grade 
increases. ^ The first two 'hypotheses state that as grade increases, the/ 
edUmetrtc validity of NRTs de^creaseS. The second tj?ree hypotheses^offer 
explanations which, although not related to the edufnetric properties of NRTs, 
cannot be discounted without further researcti. At present all. the hypotheses 
and the rank ordering are ex'ercises M speculation. However, we are confident . 
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that variation in the SGE across grades represents an important phenomenon 
v/hich may have implications for both policy makers and evaluation specialists. 
Until the grade-to-grade fluctuations in SGE are be.tter understood, researchers 
might refrain from .sweeping policy recommendations based upon cross grade 
comparisons of norm-referenced achievement test scores/ 

Test-to-Test Varia^tion * .^^ / 

^Almost as striking as the grade-to-grade variations in full year SfiEs^ 
within a test ^are the test-to-test differences within a grade. 'Examination 
of Table 2 y-evea'ls numerous instances^of SGEs being thirty to forty percent 
higher for some tests than for others. When we shift focus to schooT-year 
Sf^Es (see Table the differences across tests are even more dramatic. 
School-year SGEs • frequently vary among tests by as much as' fifty to sixty ' 
percent with isolated instances of SGEs for some tests being three to six 
times as large as those of other tests. 

A'll other things being equal, the higher the match between what is 
learned and what is tested (i.e., the higher the edumetric validity) the higher 
the SGE. An S.^E- near zero. means that either nothing was taught, or something 
was taught but nothing was learned, or the test did not reflect. what was 
taught and/or learned. .A large SGE suggests that something was learned and 
the test reflects'well whatever was learned. Presumably, criterion-referenced 
tests are superior to norm-referenced tests precisely because they provide a 
better match between what is taught/learned and what. is tested. The SGE may 
provide a simple index for evaluating the claims made on behalf of criterion- 
referenced tests that th-ey are superiior evaluati^ tools. If CRTs demonstrate 
higher SGEs than iNRTs, then these claims are likely valid. (The last sectiog^ 
on the edumetric ratio addresses this issue more thoroughly). A properly 
developed. CRT should have greater fidelity to the curriculum, and , consequently, 
larger SGEs. The^SGE may be an effective means of assessing, 'a priori, tests' - 
probably sensitivity to instruction. " 

"We offer four hypotheses for the variation in- SGEs across tests. Again 
we order the hypotheses in terms of our present thinking regarding the 
probability that each hypothesis will be sustained ip future stud'ies. 

{ • ' 
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. Edumetric Hypothesis : Nom-refarencad tests differ in the extent to which 
they reflect stable be tween-individUal differences (Car.ver^s vsjchonetric 
dimension) and the extent to .which they reflect within-individual g^rowth 
(Carver's edumetric dimension). A test may possess exemplary psychometnic 

. properties' (e.g. , high 1 nternal^consis tency and a good p value d-is tribution) 
but be insensitive to w|r^ students learn over a given treatment interval. 
Such a test will have .a low SGE but be otherwise indistinguishable from other 
norm-referenced tests. The reader is encourfaged to re-examine Table 2 in 
light of this ^hypothesis. - ^ ' ' 

Procedures Hypothesis : Test publishers^ use vastly different apvroaches to . 
interpolation/ extrapolation and make different assumptions regarding sujnmer 
growth, vhus arzifiaially creating SGE differences. The. fact that full 
year SGEs are much more comparable betv/een tests than are ^schoo? year dr ^ 
summer SGEs suggests that publishers differ considerably in the assumptions 
they make about summer growth. 

\ 

Norin Group Hypothesis : - The composition of the norm groups for the various 
tests differ to such an extent that the SGEs are affected.. Suppose, for 
example, that the Stanford Achievement Test (SAT) norm group was substantially 
brighter than the Metropolitan Achievement Test (MAT) norm group. The result 
would be that the Stanford Achievement Test norms, would reflect more- growth 
and, consequently, the SGEs for the SAT would be larger, than those for the MAT. 
We should note that findings from the Anchor Test Study, for at least four 
of the tests considered in this paper, do not account for the large SGE 
differences across tests. ^ 

Cohort Hy oothesis : Although the norm groups for the various tests were 
selected in essentially similar ways because the tests were normed in 
different years, the samples may have differed in rate of achievement. 
■Teachers are fond of claiming that, like fine wines, there are "vintage 
years" in which a particular group of students just seems brighter, however, 
the pattern of SGE differences across tests (taking into consideration the 
year each test was normed) is not consistent w44h this hypothesis. 

Of the four hypotheses just presented, the procedures and edumetric 
hypotheses seem most compelling. The fact that full-year SGEs are substantially 
more comparable across tests than either school-year or summer SGEs suggests 



that publishers may make different assumptions about what students learn during 
the summer period. Apparently publishers of the- Stanford Achievement Test 
assume that very little reading or math achievement growth should be expected 
of a fiftieth percentile student, whereas publishers of the MAT seem to assume 
a large amount of summer growth J It seems probable that evaluation findings 
will vary depending upon which test is used, how closely different publishers' 
assumptions regarding summer growth coincide wi th empirical findings, and 
whether fall to spring or spring to spring testing dates are employed. 

According to the edumetric hypothesis, some norm-referen-eed tests are more 
sensitive to student growth in reading and math than other tests. Those tests 
with, low SGEs measure well the between-individual differences which become 
more and more stable as students get older, but do a relatively poor job of. 
measuring what students learn during a particular treatment inl^erval. Most ^ 
users of NRTs, particularly evaluation specialists, are primarily interested 
in measuring achievement growth. The SGE differences across tests seem 'to' 
indicate that commonly available NRTs differ considerably in their edumetric 
validity, i.e., sensitivity to instruction. 

The implications for educational evaluation of sustaining th*e edumetric 
hypothesis are substantial indeed. First of all, assuming' the validity of 
this hypothesis, it is little wonder that most of our school- effects studies 
have accounted for such minuscule pr;oportions of variance with instructional 
process measures (cf Cooley and Lohnes, 1976). The problem may not rest with 
so-called "vveak -treatments" but rather with measurement instruments that are 
systematically biased against showi ng. ei ther significant treatment-control 
differences or substantial process-outcome relationships. When the SGE is 
as small as .15 standard deviations, as is the case with several tests at the ^ 
eighth grade level, is it any v^onder we find very few "exemplary" eighth 
grade reading and math programs or that Coleman (et al. 1966) could find 
so few school variables that correlated with STEP Reading Test scores. 
Similarly, it is perhaps no coincidence that at those grade levels where the 
SGEs are largest and, presumably the edumetric validity of the tests is 
highest, wa||||r)d .a higher frequency of "exemplary" projects. An evaluation 



The fact -that both the Stanford and Metropolitan claim to have empirically 
determined fall and spring norms makes the substantial differences in summer 
SGES' for these tvib tes>s aTl the more puzzling. 
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study that employs an' NRT with a low SGE may be a priori doomed to add yet 
another conclusion of "no significant- difference" to the literature on 
school effects . 

Summer Loss Phenomenon • > r-^ 

Several recent studies .have highlighted the fact that Title I students 
achieve above expectation during the regular school year and lose in relative 
standing during the summer months (Pelavin and David, 1977; Stenner et al . , 
1977). Title I projects that use fall to spring testi ng^ dates often report 
substantial treatment effects, whereas projects that use fall to' fall or 
spring to spring testing dates often report no treatment effects (Pelavin 
and David, 1977). In general, there has been limited aopreciation for 
the different conclusions regarding treatment effects that result from 
simply varying testing dates. For example, tentative procedures in the 
OE Title I Evaluation System call for aggregating treatment effects without 
regard for testing dates. Similarly, the Joint Dissemination Review Panel 
typically evaluates reported treatment effects without considering testing 
dates. . - / 

Table 4 presents standardized growth expectations for the summer 
period (spring^to fall). Except for the SAT, all tests exhibit substantial 
growth expectations over the summer period. We suggest that an edumetri cal ly 
valid achievement test should have a large SGE over the school year and a 
small summer SGE. However, since the size of both school-year and summer 
SGEs can be manipulated by making different assumptions about summer growth," 
the data presented in this. paper cannot speak directly to this point. If , 
empiricaT data could be collected at three points in time (fall, spring, 
fall) for all coirinonly used NRTs, then the ratio of summer growth to school- 
year growth might address the question of comparative edumetric validity. 
Under such an analysis, when. SGEs for the summer* period approach or exceed 
SGEs for the school year, a test's sensitivity to instruction must be 
questioned. Large summer SGEs would suggest that/the construct being 
measured by the test evidences growth whether or /not the student is in school. 
Such an instrument would not only be relatively less sensitt^ve to 'instruction- 
related achievement growth but would also presumably be insensitive to special 
project treatment effects. Again we emphasize that given the lack of multip^ 
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empirical norming points, the summer - school -y^ar ratios given in Table 5 
may simply reflect variation in publishers' assumptions about summer growth. 

The large summer SGEs exhibited by four of the five tests exanined in 
this paper raises questions about how much of the report summer loss among 
Title I students is due to ^absolute loss in achievement and how much is due 
to ^>^mptions made by publishers. If Title I students actually lose raw 
score points over the summer period then we must conclude that there is an 
absolute loss in^acquired skills and knowledge. If, however, there is no 
raw score change from spring to fall, then the Title I summer loss is relative 
rather than absolute and is a function of^publisher assumptions. Discussions 
with other researchers studying this phenomenon suggest that there is some 
doubt as to whether the absolute achievement loss among Title I students 
is as large as is commonly bel ieved. According to the arguments presented 
in this paper, the amounts of both absolute and relative loss may depend 
upon the NRT employed in the evaluation. 

The Edumetric Ratio 

The fact that students do not atten.d school year around suggests a 
means for computing edumetric validities for commonly used norm-referenced 
and criterion-referenced tests. An edumetrical ly valid test, i.e., a test 
which js sensitive to instruction, should evidence proportionately higher • 
SGEs during the school year than during t\\e summer. If we assume a nine 
month school year and' a three month summer, then any test purporting to 
measure what is taught in school (e.g., reading comprehension and math 
computation) should evidence a ratio of "school-year SG£" to "summer period ^ 
SGE'* larger than 3:1 (For convenience, we term this Va^lue the edumetric 
ratio). On the other hand, a test of nonverbal reasoning might evidence 
an edumetric ratio near '3:1, indicating that noverbal reasoning (or what 
Cattell (1971) calls Fluid Ability) grows at a constant rate largel^y 
unaffected by school experiences J ;Tests purporting to measure skills and 
objectives taught in school which show edumetric ratios near "three" would 
probably prove highly insensitive to treatment effects and might be expected 
to evidence near zero correlations with variables similar to those employed 
by Coleman et al, ('19'65) . 



Edumetric ratios computed separately for different 'SOCio-economic groups 
might provide insight into how differences in out-of -school and in-school 
experiences impact on achievement. 

il2 



- - The edumetrj.ic ratio may also provide a means of externaUy validating 

criterton-referenced lest- items. Typically CRT validation eVforts rely 

heavily on^contenf^an'alysis and judgements of curriculum, experts regarding 

the 'match between curriculum and'wha^-a test item presumably measures., 

Edume.tric validity of such items is assumed when pudges, agree on what concept 

or, objective an item is measui^iijg? We suggest that rating consensus, is in- * 

• sufficient evidence to c'orrclude that a test item is edumetrical ly valid. 

One more methodologically defensible -approach miq'ht be to compute edumetric 

. ■ f ■ ■ " ■ 

ratios on a set of items judged^tq_be measuring a particular concept or 

objective and. include on the final instrument only those 'i tems' wi th hioh ra^tioV 

La§tly, a -comparison, of SGEs for a widely used achievement and ability 

test offer. sonie additional ins.ights into, the afiti tude-achievement distinction 

(Green, 1974). Judging from theory and publisher test descriptions, one 

would'expect achievement tests to have higher SGEs than ability tests. '- For 

fe^mple, the Technical Manual for the Cognitive Abil ities Test (Thorndike 

and Hagen; 1971) states-: "^..The test can be characterized by the following 

statements and these characteristics describe behavior that is important to 

measur'e for understanding an individual 's educational and work potential: 

(1) The tasks deal with abstract and general concepts, (2) In most cases, ,; 

the tasks require the interpretation and use of symbols, (3) In large part, 

, it is relationships among concepts and symbols with which the examinee must 

deal, (4) The tasks require the examinee to be flexible in his basis for 

organizing concepts and symbols, (5) Experience must be used in new patterns, 

and (6) Power in working.with abstract 'materials is emphasized, rather than 

speed" (p. 25). Contrast the above description with that given in the technical 

manual for the Iowa Test of Basic Skills, ..."The ITBS provides forcompre- 

hensive and continous measurement-of growth in the fundamental skills: 

vocabulary, reading, the mechanics of writing, methods of study, and mathematics 

These skills are crucial to current day-to-day learn.ing activities as well as 

to future educational development" (p. 3). In the ability test manual, phrases 

such as "educational potential," "general concepts," and "interpretation- and 

use of symbols" are- used whereas the achievement test manual uses such terms 

as "growth," "fundamental skills," "diagnosis," and "skill improvement." 

Clearly the impression one gets from these two manuals is that the Cognitive 

Abilities Test measures something more stable and les'^ sensitive to school 

experiences than i^e ITBS; an impression which is not su§tained by the SGE data. 
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Table 7 contrasts SGEs for the Cognitive Abil i ties^Test and the ITBS. 
A first observation- is that ITBS-Reading SGEs a/e comparable to CAT-Verbal 
SGEs. Thus, the ITBS-Reading appears to be almost as sensitive to instruction 
as the CATt;Verbal. 'Whether comparab'i 1 i ty between^he two is due to the 
fact that the achievement test is actually more an ability test or the ability 
test is jusLa rela^ejed achievement test, or the distinction between verbal 
ability and reading is a sham, merits further study J 0^ conclusion appears 
disconcertingly clear, the ITBS-Reading appears to be only slightly more 
edumetrically -valid than the CAT-Verbal. How serious this predicament is 
depends on whether one 'elects to fault the CAT for "being too much like an • 
achievement test or'the ITBS for being too much like an ability test. 
. - Jl^s CAT-Quantitative appears, to be. less eSumetrical ly. val id than the 
ITBS-Totai Math, but. more sensitive to "instruction than the -CAT-Nonverbal . 
Since the eAT-Quanti tative items loaded highly on the nonverbal /factor and " "' 
failed to define a quantitative factor (Thorndike and^Hagen, T97l,ip.32) 
one is left with the possibility that the Quantitative items ire simply a 
mixture of items similar to ITBS-Tatal flath items and nonverbal reasohing 
items. Had the Quantitative Scale Jield more true to its label, we suspect 
that the SGEs would more .closely approximate thosjp for ITBS-Tota'l Math. 
Finally, the CAT-Nonverbal evidences ^the lowest sk.. Whether' the nonverbal 
growth- expectations for the summer period are proportional to the school 
year, indicating little school effect, is a question requiring further study. 

A Prospectus For Research . • . 

A majoc thesis of this paper is tjiat policy decisions based upon grade- 
to-grade and test- to-test comparisons ^est on a potentially shaky foundation. 
.If the SGE index is meaningful and, the analyses based upon it are valid, 
. then a potentially large number of research findings merit re-examination. 

■ Granting the far-reachi ng pol icy implications inherent in our assertions ^ 
and the need to establish quickly whether or not these assertions are valid, 

■ we offer~the following research agenda. 
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We are not suggesting that just because two tests have similar SGEs they are- 
inecessarily measuring the same thing. We are suggesting that.evidence'of 
comparable SGEs when added. to information that disattenuated/ihter-test 
correlations approach 1.00 provides a pretty strong case for the fact that 
the two tests measure the same psychological construct. 
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J - Submit the^SGE concept to comprehensive analysis^y measurement 
spetia'l ists ^Focus^ng upon the conceptual basis for the i-ndex 
and assumptions urrderlying its computation. ' / ' 

- Compute SGEs for all subtests of commonly used achievement 

and ability tests marketing during the past twenty ye^s, and^ 

compare SGEs across grades and subtests. Some form of "multi -method , 

, . - • ' ' if/ 

mul tT-^it|nalysis might prove useful in such a substudy 

- Conduct. -a^^^^ehensi^ve content analysis of commonly used MRTs ' 
to detiermine''the extent to wh^ch 'item'^content, type, or format 
contribute to variability in SGEs across >tests (suggested by 
Joe Haenn: personal communication). 

- Conduct a meta Analysis of reported treatment effects across a 
wide range of studies to d.etermine whether treatment effects Nare 
correlated with SGEs. Preliminary investigations suggest that 
this may be a particularly fruitful .area for further investiga- 
tion. 

- Conduct a logical and empirical ^analysis of the' summer loss 
phenomenon^found among Title I-students. Estimate, if possible,, 
what proportions of the loss are relative and absolute, and 
examine ways these proportions differ depending upon which NRT 
is used. Also conduct an item analysis to determine which skills 
evidence the largest losses over the summer. 

- Conduct a preliminary investigation of the relationship between 
shifting edumetric validity and the Scholastic Aptitude Test 
score decline. 

- Compute SGEs for a sample of criterion-referenced tests and 
investigate the claim that the SGE provides a useful index 

for comparing edumetric validities of CRTs and NRTs. ' 

- Conduct a logical an^l empirical examination of the effects of 
out-of-level te^iTng on the edumetric validity of NRTs. 

The above research agenda will first address the utility and validity 
of the SGE concept and then proceed to examine selected implications of 
sustaining the edumetric hypothesis. The current nationwide interest in 
basic skills testing makes the topic^of the proposed research particularly 
policy relevant at this time. 

• '^l5 • ■ 
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, " ^ TABLE 1 

RAW SCORE Tp-"PERCENTILE TABLE FOR 'bEGJNNING AND END 
OF FIRST GRADE ON CTBS, LEVEL B 



•r 




Raw Score 

73-84 
86-72 
65-67 
61-64 
59-60 
57-58 , 
55-56 
53-54 
52 

V: 

31 
31 
31 . 
31 
31 
30 
29 

29 
29 
29 

20 
19 
18 

18 - 
18 
18 

16 ' 
15 
0-14 



Beginning of Firs: Grade 



Total ^ Reading' 



Percen tile 

99 
98 
97 
96 
95 
94 
93 
^ 92 
91 



50 
49 
48 
47 
46 
45 
44 
43 
42 
41 

10 
9 
8 
7 
6 
5 
4 
3 
2 
1 




Endjof First Gf 



Raw Score 


Percentile 


84 ' ■ 


yy 


84*: 


GQ 


84' 


J/ 


.•84 


96 


84 


- 95 


83 




83 J 




82 




82 


Q 1 


59* 


ou 


58 


49 


58 


48 


57 


47 


/ 56 


46 


. , 55 


45 


54 ' > 


, 44 


53 


43 


53 


42 


52 ^ 


41 


* 32 


^ 10 


31 


9. 


' 30 


8 


29 


7 


28 . 


6 


' 27 


5 


25-26 


4 


24 


3 


21-23 


2 


0-23 


1 



•1 



} 



■ 1./ - 1:\ 

2.1 y 3.r 

I 

/ 1.7;- 5.7 

I: 

5.7 - 6t7 
U - hi 
7.7 - P,.7 
8.7 - 9.7 



TIIIAI. RliDIIIG 
* 



Spriii'i to Spring / 
GfJide Pcri(i(F' , .St.nnrnnl ' 'iJBr.'' 



1,??, .95 
.(A ^' ,71 



.5!i 



A1 



.33 



.3(1 



.71 



.71 



.57 



..10 i .41 



.20 



.33 



' WBLE 2 

EIPECTATlOflS rOf! SELECTED ICSTS 

m-mm (m\ii) , 

('STAflDARD DELATION ONUS) 



mijK hm'^ Iletropoiitan ^ 



.71 
.()] 
.17 
.30 



.3,'! 
.25 



.7^ 



152 



.30' 



.30 .3(1 



,2B 



.'23 



■ 7 



.71 



.71 



,11 
.30 
.'11 



^Stanford was normcd in (jctfibpr and Hay ' 
jITIIS w<r. nunnod in Novoi|iljcr 
^CAI-77 was norried in finyoiiihor and Hay 
jCTBS m nomipd in A|)ri 
Metropolitan wds noniirdhn October and April 

? - Stondard procedure fkr computing SGE could not be fol lowed nivcn that 
spring to spring norms are not ,'ivailablc for indicated levels of the 



\ 



TOTAL 



Stanford ilBS CAi-77 



,17 



.99 ,111 



M ,91 



,f)1 
,11 
,33 



,28 



,71 

.67 

,19 

,17' 

,-33 



,23 



.99 



,fi1 
,19 
,36 
.17 
,2B 



18 '< 



\ 



'table 3 

•STANl)Alil)l/EI) GIKIHTIl EXPtCIAIIONS R)H SUtCTEIl IfSli 



/all to 5(li;tafl 
'Grade Pcridti. 

3,1.-3,7 
i ' 

.4.1 - 4,7. ■ 

■V 

5.1 - 5.7 ' 
6.1 - 6;7 , 
'7.1-7.7 
8.1 - 8.7 • ' 
9|1 - 9.7 ' 



'• mM (ifAOlNC 

StaiifoW \ ITB5 
,95 , 



•l74 



A- 



.58 



JO 

.2li' 
.18 
.18' 



.47 

, Ai 
.3(1 

,.33' 
.28 

.28 



CAT-77 
,56'. 

,49 

JO 
■■.25 

,20 ■ 
, .10 
■,18 

.10 



(STANIkD DEl/lilJION UNITS) 

\ . , 
SCHOOL- Yf All'' , 




? -.Could n.Qt ™ipute:these values 'froln infonnatlon given in 
' pubiislisrs' itianuals^ , -v' 



I-lL>tni|i»liliin 
1,17 

,84 

.bit 

■ .'30 



.05 
.10 
7 



(iiifliiiii [i(i'[(;T/\TinN5 {m.\\\.i] m mm tests 

(STAIIDAIII) DEVl/lTIONIi) 



SpriiK) tn fill] 
^ oradc IVriod 

2.7 - .LI 
3.7 -il 
i7 - 5.1 
4.7 -,fi.l 
6.7 - 7.1 
7.7 -n.i 
B.7 - 9.1 



Sloiiford nilS 
.10 ,311 

4 

.0? .25 



I'll ms 



.2,1 .% 



.23 ,70 



.25 



.12 



.<11' .20 , ,1? 
.10 .15 ' ,211 .10 
M- ,10 ,15 ■ ' 7 



y 



.30 



,18 



.25 



.3f) 

A. 



J!) 



? - Could not coffl(Mito lliesc values rrniii iiilomidtion in tho CTBS iiianiifils. 



SUnford 



TOlV m\ 
nils [AT-7/ 



,05 .3,'! 



,02 .23 



.20 



.20 .15 



.23 
.25 



.28 .25 
.25 ,18 



.12 



.20 



22 



( r 



.("•nidc Staiifdril 



.11 



.0.1 



.!)] 



iiAiio 01 '.iniiii (inoMiii r.xi'iciAiioN 10 saiiKii yiAii mm]m\ {mwwy 



Wl-// i;n« np|,rn|i()lil:,iii 

•11 ./() ' ,'11 
.f'? .'id 1.(111 



.04 ,f,i, r,(| 

r 

^ .(id ,r,(i ,/]? 

't 

.Gl ./d .li? 



•3fi 1,11 3 



.11 ..16 .fid 



.dn 



!il.ni)f(ii 
,ld 



.50 



.10 



lilts m-n 

.Id 



.Od .dC 



•.Id .d7 

,0d , .dl 
,71 



,50 



CIHS 
.115 



.10 M 



M ■ .11! ' ,93 



.-in ,d? 

.37 ,d5 

1.11 1,00 

I..14 7 ' 



llclropoliLiiii 



1.67 

7.Z0 
1.00 



"Growth pxppctatioii over UiiMiiiiinor pi'i'loil is zern. ,. 
? -. Could not. compiile (.hi? v.iliii! frniii informotiDn provided in tlw IWT n,inii,il. 



/ 



1 

.a; ■ l» 



. TABLE 6 

STANDARDIZED GROWTH EXPECTATIONS FOR 
THE IOWA TEST OF BASIC SKILLS AND THE 
COGNITIVE A&ILITIES TEST 











ITBS 
Reading 
Comprehens ion 


' Cognitive"^ 
- Abi 1 ities 

Test 
.. Verbal • ■ 


ITBS 

Total 

Math 


Cognitive 
Abil ities 
Test 

Quanti ta- 
ti ve 


Cogni ti ve 
Abi 1 i ties 
Test 
- Nonverbal 


3 


.7 


- 4 


.7 


.74 


.74 


.91 


.71 


.38 


4 


7 


- 5 


.7. 


.74 


.61 


.71 


.52 


.38 


5 


7 


- 6 


.7 


.57 


.52 


.67 


.38 


.^3 


6. 


7 


- 7, 


7 


.47 


.41 


.49 


.36 


.20 


7. 


7 


- 8. 


7 


.41 


.30 


.47 


• .36 




8. 


7' 


- 9 


7 


.33 


.30 


.^3 


.28 


.23 
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